When will a server fail catastrophically in an industrial datacenter? Is itpossible to forecast these failures so preventive actions can be taken toincrease the reliability of a datacenter? To answer these questions, we havestudied what are probably the largest, publicly available datacenter traces,containing more than 104 million events from 12,500 machines. Among thesesamples, we observe and categorize three types of machine failures, all ofwhich are catastrophic and may lead to information loss, or even worse,reliability degradation of a datacenter. We further propose a two-stageframework-DC-Prophet-based on One-Class Support Vector Machine and RandomForest. DC-Prophet extracts surprising patterns and accurately predicts thenext failure of a machine. Experimental results show that DC-Prophet achievesan AUC of 0.93 in predicting the next machine failure, and a F3-score of 0.88(out of 1). On average, DC-Prophet outperforms other classical machine learningmethods by 39.45% in F3-score.
展开▼